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ABSTRACT 



A technique for injecting corpus-based preference into 
syntactic text parsing is provided. Specifically, the 
problem of tagging content-word pairs by part-of- 
speech is solved by using thematic analysis. A new 
measure of the fixed or variable nature of such word 
pairs is created and used to classify word pairs as either 
noun-verb, adjective-noun, or verb-noun. 

4 Claims, 1 Drawing Sheet 
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METHOD FOR TAGGING COLLOCATIONS IN 
TEXT 



The complex scope of the pre-processing task is best 
illustrated by the input to the preprocessor shown be- 
low. 



BACKGROUND OF THE INVENTION 5 

Sentences in a typical newspaper story include idi- 
oms, ellipses, and ungrammatical constructs. Since au- 
thentic language defies text-book grammar, the basic 
parsing paradigm must be tuned to the nature of the text 1Q 
under analysis. 

Hypotheucally, parsing could be performed by one 
huge unification mechanism as described in the litera- 
ture: S. Schieber, "At Introduction to Unification-based 
Approaches to Grammar", Center for the Study of 15 
Language and Information, Palo Alto, Calif., 1986 and . 
M. Tomita, "Efficient Parsing for Natural Language", 
Lluwer Academic Publishers, Hingham, Mass., 1986. 
Such a mechanism would receive its tokens in the form 
of words, characters, or morphemes, negotiate all given 20 
constraints, and produce a full chart with all possible, 
interpretations. 

However, when tested on a real corpus, (Le., Wall 
Street Journal (WSJ) hews stories), this mechanism 
collapses. For a typical well-behaved 33-word sentence 25 
it produces hundreds of candidate interpretations. 

To alleviate problems associated with processing real 
text, a new strategy has emerged. A preprocessor, capi- 
talizing on statistical data has been described in the 
literature: K. Church, W. Gale, P. Hanks, and D. Hin- 30 
die, "Parsing, Word Associations, and Predicate-Argu- 
ment Relations", Proceedings of the International 
Workshop on Parsing Technologies, Carnegie Mellon 
University, 1989 and I. Dagan, A. Itai, and U. Schwall, 
"Two Languages are More Informative Than One", 
Proceedings of the 29th Annual Meeting of the Associa- 
tion for Computational Linguistics, Berkeley, Calif., 
1991. Such a processor is trained to exploit properties of 
the corpus itself, highlights regularities, identifies the- ^ 
made relations, and in general, feeds digested text into 
the unification parser. 

Consider the following WSJ, (Aug. 19, 1987) para- 
graph processed by a preprocessor: 
Separately, Kaneb Services spokesman/nn said/vb 45 
holders/nn of its Class A preferred/jj stock/nn 
failed/vb to elect two directors to the company/nn 
board/nn when the annual/jj meeting/nn resu- 
med/ vb Tuesday because there are questions as to 
the validity of the proxies/nn submitted/vb for SO 
review by the group. 
The company/nn adjourned/vb its annual/jj mee- 
ting/nn May 12 to allow/vb time/nn for negotia- 
tions and expressed/vb concern/nn ab out future/jj 
actions/nn by preferred/jj holders/nn. 55 
The problem which the present invention is intended 
to solve is the classification of content-word pairs into 
one of the following three categories. 
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1. and expressed/VB concern/NN 

X Services spokesman/NN said/VB 
3. class A preferred/JJ stocfr/NN 



The constructs expressed concern and spokesman 65 
said must be tagged verb-object and noun-verb respec- 
tively. Preferred stock, on the other hand, must be iden- 
tified and tagged as a fixed adjective-noun construct . 



This lexical analysis of the sentence is based on the 
Collins on-line dictionary plus morphology. Each word 
is associated with candidate parts of speech, and almost 
all words .are ambiguous. The tagger's task is to resolve 
the ambiguity. 

A program can bring to bear 3 types of clues in re- 
solving part-of-speech ambiguity. The first is local con- 
text Consider the following 2 cases where local context 
dominates: 

1. the preferred stock raised 

2. he expressed concern about 

The words the and he dictate that preferred and ex- 
pressed are adjective and verb respectively. This kind 
of inference, due to its local nature, is captured and 
propagated by the preprocessor. 

The second clue is global context Global-sentence 
constraints are shown by the following two examples: 

1. and preferred stock sold yesterday was . . . 

2. and expressed concern about . . . period* 

In case 1, a main verb is found (he., was), and preferred 
is taken as an adjective; in case 2, a main verb is not 
found, and therefore expressed itself is taken as the main 
verb. This kind of ambiguity requires full-fledged unifi- 
cation, and it is not handled by the preprocessor. Fortu- 
nately, only a small percent of the cases (in newspaper 
stories) depend on global reading. The third type of due 
is corpus analysis and is described in R. Beckwith, 
"Wordnet: A Lexical Database Organized in Psycholin- 
guistic Principles" in Lexical Acquisition: Exploiting 
On-Line Dictionary to Build a Lexicon, Lawrence Erl- 
baum Assoc., 1991. 

SUMMARY OF THE INVENTION 

In accordance with the present invention, a method is 
provided for performing part of speech tagging for 
content-word pairs in a natural language text processing 
system. Content-word pairs are first identified in a large 
corpus of text used for training purposes. For each 
word pair identified, a variability factor is calculated. 
This variability factor is a measure of the variability of 
the form that the particular word pair takes in the train- 
ing text A database of all of the content-word pairs and 
their associated variability factors is created for use by 
a program which performs tagging of a body of text 
This database provides additional information in the 
form of the variability factors which can be used in 
conjunction with other known tagging methods such as 
local context analysis. 

In another embodiment of the present invention, a 
mutual information score is used to control which word 
pairs occurring in the training text are to be stored in the 
database. 
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BRTFF INSCRIPTION OF THF DRAWTNO W tf OQly affix " Stripped » ^und, or 

BRIEF DESCRIPTION OF THE DRAWING if VF (variability factor) is smaller than threshold, 

While the novel features of the invention are set forth then tag first word a verb and the second word a 

with particularity, in the appended claims, the invention, noun; 

both as to organization and content, will be better un- 5 (c) if VF is larger than threshold, then tag adjecti ve- 
derstood and appreciated, along with other objects and noun or noun-noun (depending on lexical proper- 
features thereof, from the following detailed description ties of word, Le. , running vs. meeting), 
taken in conjunction with the drawing, in which: Checking for the noun-verb case is symmetrical (in 
The sole FIGURE is a schematic diagram which step 2.b). The threshold is different for each suffix and 
shows the elements of the present invention. 10 should be determined experimentally (initial threshold 

DESCRIPTION OF THE INVENTION ^ti^UoSiontext rules override corpus pref- 

The focus of the present invention is on how to ex- erence. Thus, although preferred stocks is a fixed con- 

ploit the preferences encountered in corpus analysis struct, in a case such as John preferred stocks, the algo- 

using thematic analysis (analysis of word relationships). 15 rithm will identify preferred as a verb. Upon comple- 
Referring now to FIG. 1, there is shown a corpus of tion of thematic analysis tagging, the text is in condition 

text 112 to be used for training. Trainer 114 is a com- to be passed to parser 130 which performs parsing on 

puter program which reads corpus 112 and creates the tagged text 

collocation database 116 using an algorithm which is The algorithm yields incorrect results in two prob- 

described below. Once collocation database 116 is in 20 lematic cases. 

place the actual tagging process may begin. An input The first is ambiguous thematic relations which are 
device 122 is used for entering text to be tagged. The collocations that entertain both subject-verb and verb- 
text is first processed by local context tagger 124 which object relations, Le., selling-companies (as in "the corn- 
uses local context rules to tag words. When no rule pany sold its subsidiary ..." and "he sold companies . 
applies for tagging a word, then the word is tagged "??" 25 . . "). 

(meaning "untagged"). When local context tagging is The second is interference between coinciding collo- 
completed, the text is next processed by thematic analy- cations such as: market-experience and marketing- 
sis tagger 126. Thematic analysis tagger 126 uses data- experience, or ship-agent and shipping-agent Fortu- 
base 116 to tag the word-pairs left untagged by tagger nately, these cases are very infrequent 
124. Word pairs are either tagged as fixed collocations 30 Adjectives and nouns are difficult to distinguish in 
or thematic relations depending on a variable factor raw corpus (unless they are marked as such lexically), 
associated with the word pair in the database. Parser For example, since the lexicon marks light as both ad- 
130 is standard text parsing software which accepts as jective and noun, there is no visible difference in the 
input the marked up text as processed according to the corpus between light/JJ beer and light/NN bulb. The 
method just described. 35 present algorithm tags both light cases as a noun. 

1. If a word pair is a collocation (e,g., holding compa- The example below illustrates the use of a fixed and a 
nies), and one of the two words is tagged "??", variable collocation in context, and motivates the need 

then generate the S-stripped version (Le., holding for thematic analysis. In this small sample, 8 out of 35 

company), and the affix-stripped version (i.e., hold cases (the ones marked "-") cannot be resolved reliably 

company). 40 by using local context only. Without using thematic 

2. Look up database. analysis, a tagger will produce arbitrary tags for taking 
(a) If neither collocation is found, then do nothing. and operating. 



o latest version of the UNIX V operating system software and some — 
th Microsoft 's MS °slash° DDS operating system °period° Microsoft — 
ties obtained licenses for the operating system 'period 0 With the + 
nths before IBM can provide an operating system that taps its mach + 

O oomma° much as Microsoft 's operating system software is now th + 
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"hyphen 0 Telegraph Co. 's UNIX operating system °comma° fast becom — 
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homeowner *s refinancing to take advantage of lower interest ra + 
g complete pc systems °period° Taking advantage of their lower °hy - 

rally came from investors who took advantage of rising stock pric + 
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Existing statistical taggers which rely on bigrams or 
trigram, but which do not employ thematic analysis of 
individual collocations fare poorly on this linguistic 
aspect. 

A database of collocations must be put in place in 
order to perform thematic analysis. Ideally, the data- 
base is acquired by counting frequencies over a tagged 
corpus. However, a sufficiently large tagged corpus is 
not available. To acquire an adequate database of collo- 
cations, the full 8 5 -million WSJ corpus is needed. It is 
necessary to infer the nature of combinations from indi- 
rect corpus-based statistics as shown below. 

The basic linguistic intuition of the present invention 20 
is presented below. 
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Training over the corpus requires inflectional mor- 
phology (analysis of word roots). For each collocation 
P the following formula is applied to calculate Fs Vari- 
ability Factor (assume the collocation P is produced 
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Verb-Noun Relations 
387 expressed-concern 72 
expressed-concerns 22 
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260 
159 

Noon-Verb Relations 
118 analysts-note 51 
192 analysts-noted 8 
192 analysts-noted 2 
13 analysts-noting 
79 analyst-noted 
6 analyst-notes 
6 analyst-notes 
6 analyst-notes 
9 analyst-noting 

. Adjective-Noun Constructs 
3558 preferred-stock 2 
11 preferred-stocks 627 
86 
2 
2 
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took-ad vantage 



spokesman-acknowledged 
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operating-systems 
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Frequencies of each variant in the WSJ corpus are 
shown. For example, joint venture takes 3 variants to- 
taling 4300 instances, out of which 4288 are concen- 
trated in 2 patterns, which in effect (stripping the plural 
"S" suffix), are a single pattern. For produce car no 
single pattern holds more than 21% of the cases. Thus, 
when more than 90% of the phrases are concentrated in 
a single pattern, it is classified as a fixed adjective-noun 
(or noun-noun) phrase. Otherwise, it is classified as a 
noun-verb (or verb-noun) thematic relation. 



50 



55 



Where fW (plural (P)) means the word frequency of the 
plural form of the collocation; fW (singular(P)) means 
the frequency of the singular form of the collocation; fR 
(stemmed (P)) means the frequency of the stemmed 
collocation. 

Accordingly, VF (producing-car) = VF (producing- 
cars)=0.32; and VF (produce-car) is (by coincidence) 
0.32. In contrast, VF (joint-venture) is 1.00. A list of the 
first 38 content-word pairs encountered in a test corpus 
is shown below. 
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The frequency of each collocation P in the corpus rela- 
tive to its stem frequency is shown. The ratio, called 
VF, is given in the first column. The second and third 
columns present the collocation and its frequency. The 30 
fourth and fifth column present the stemmed colloca- 
tion and its frequency. The sixth column presents the 
mutual information score (MIS). The MIS is calculated 
by dividing the number of occurrences of the colloca- 
tion by the number of times each individual word in the 35 
collocation occurs alone. During training, collocations 
with MIS values below a selected threshold may be 
ignored. 

Notice that fixed collocations are easily distinguish- 
able from thematic relations. The smallest VF of a fixed 40 
collocation has a VF of 0.86 (finance specialist); the 
largest VF of a thematic relation is 0.S6 (produce con- 
crete). Thus, a threshold, say 0.7S, can effectively be 
established. 

While specific embodiments of the invention have 45 
been illustrated and described herein, it is realized that 
modifications and changes will occur to those skilled in 
the art It is therefore to be understood that the ap- 
pended claims are intended to cover all such modifica- 
tions and changes as fall within the true spirit and scope 50 
of the invention. 

What is claimed is: 

1. A method for performing thematic part-of-speech 
tagging for collocations having content-word pairs in a 
natural language text processing system comprising the 55 
steps of: 



identifying collocations of content-word pairs in a 
large corpus of text; 

calculating, for each of said collocation content-word 
pair identified, a variability factor which is a mea- 
sure of variability of said collocation content-word 
pairs occurring in said text; 

storing said collocation content word pairs and asso- 
ciated variability factors in a collocation database; 
and 

using said database to tag collocation content-word 
pairs according to said variability factors, wherein 
collocation content-word pairs with high variabil- 
ity factors are tagged as having a verb and a noun 
thereat and collocation content-word pairs with 
low variability factors are tagged as having an 
adjective and a noun thereat or a noun and noun 
thereat 

2. The method of claim 1 wherein a collocation con- 
tent-word pair and associated variability factor are 
stored in said database when the mutual information 
score for said collocation content-word pair is above a 
selected threshold. 

3. The method of claim 1 wherein said high variabil- 
ity factors exceed 0.75 and said low variability factors 
are less than or equal to 0.75. 

4. The method of claim 1 comprising the additional 
step of using local context analysis to tag collocation 
content word pairs before using said collocation data- 
base. 

***** 
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